Direct Gradient-Based Reinforcement Learning: I. Gradient Estimation Algorithms

نویسندگان

  • Jonathan Baxter
  • Peter L. Bartlett
چکیده

Abstract Despite their many empirical successes, approximate value-function based approaches to reinforcement learning suffer from a paucity of theoretical guarantees on the performance of the policy generated by the value-function. In this paper we pursue an alternative approach: first compute the gradient of the average reward with respect to the parameters controlling the state transitions in a Markov chain (be they parameters of a class of approximate value functions generating a policy by some form of look-ahead, or parameters directly parameterizing a set of policies), and then use gradient ascent to generate a new set of parameters with increased average reward. We call this method “direct” reinforcement learning because we are not attempting to first find an accurate value-function from which to generate a policy, we are instead adjusting the parameters to directly improve the average reward. We present an algorithm for computing approximations to the gradient of the average reward from a single sample path of the underlying Markov chain. We show that the accuracy of these approximations depends on the relationship between the discount factor used by the algorithm and the mixing time of the Markov chain, and that the error can be made arbitrarily small by setting the discount factor suitably close to . We extend this algorithm to the case of partially observable Markov decision processes controlled by stochastic policies. We prove that both algorithms converge with probability 1.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Residual Algorithms: Reinforcement Learning with Function Approximation

A number of reinforcement learning algorithms have been developed that are guaranteed to converge to the optimal solution when used with lookup tables. It is shown, however, that these algorithms can easily become unstable when implemented directly with a general function-approximation system, such as a sigmoidal multilayer perceptron, a radial-basisfunction system, a memory-based learning syst...

متن کامل

Regularized Policy Gradients: Direct Variance Reduction in Policy Gradient Estimation

Policy gradient algorithms are widely used in reinforcement learning problems with continuous action spaces, which update the policy parameters along the steepest direction of the expected return. However, large variance of policy gradient estimation often causes instability of policy update. In this paper, we propose to suppress the variance of gradient estimation by directly employing the var...

متن کامل

Fast Online Policy Gradient Learning with SMD Gain Vector Adaptation

Reinforcement learning by direct policy gradient estimation is attractive in theory but in practice leads to notoriously ill-behaved optimization problems. We improve its robustness and speed of convergence with stochastic meta-descent, a gain vector adaptation method that employs fast Hessian-vector products. In our experiments the resulting algorithms outperform previously employed online sto...

متن کامل

Cumulative Prospect Theory Meets Reinforcement Learning: Prediction and Control

Cumulative prospect theory (CPT) is known to model human decisions well, with substantial empirical evidence supporting this claim. CPT works by distorting probabilities and is more general than the classic expected utility and coherent risk measures. We bring this idea to a risk-sensitive reinforcement learning (RL) setting and design algorithms for both estimation and control. The estimation ...

متن کامل

MAP Inference for Bayesian Inverse Reinforcement Learning

The difficulty in inverse reinforcement learning (IRL) arises in choosing the best reward function since there are typically an infinite number of reward functions that yield the given behaviour data as optimal. Using a Bayesian framework, we address this challenge by using the maximum a posteriori (MAP) estimation for the reward function, and show that most of the previous IRL algorithms can b...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999